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In Indonesia, rainfall is one crucial triggering factor for landslides. This paper 
aims to build landslide event prediction models using several machine 
learning and artificial intelligence algorithms. The algorithms were trained 
with two different methods. The input of the algorithms was precipitation data 
obtained from the global satellite mapping of precipitation satellite 
observation, and the target was landslide event occurrence data obtained from 
the Indonesian National Board for Disaster Management. Each algorithm 
provided some model candidates with different parameter settings for each 
method. As a result, there were 52 and 72 model candidates for both methods. 
The best model was then chosen from each method. The result shows that the 
model generated by generalized linear model was the best model for the first 
method and deep learning for the second one. Furthermore, the best models at 
each method gained 0.828 and 0.836 for the area under receiver operating 
characteristics curve, and their log-loss were 0.156 and 0.154. The second 
method, which used input data transformation, provided better performance. 


Rainfall This is an open access article under the CC _BY-SA license. 
©) BY SA 
Corresponding Author: 


Hastuadi Harsa 

Research and Development Center, Indonesia Meteorology Climatology and Geophysics Agency 
JI. Angkasa I No. 2, Kemayoran, Jakarta Pusat, DKI Jakarta, 10720, Indonesia 

Email: hastuadi@ gmail.com 


1. INTRODUCTION 

Classification is one of the significant tasks for machine learning (ML) and artificial intelligence (AI) 
[1]-[6]. The objective function of classification is to isolate regions in input feature space. The isolated regions 
are then labeled using guidance called target. Numerous algorithms with different approaches have been 
developed to determine how points in the input feature are associated with target labels. ML is designed to 
search this relationship using statistical schemes, while AI uses nature-inspired ones. This paper applies both 
ML and AI to classify some rainfall properties and whether they will potentially trigger a landslide or not. The 
classification is a binomial classification model derived from the best model chosen from many model 
candidates provided by ML and AI. the models were trained to recognize patterns in the rainfall properties data 
so that they were able to distinguish which combination of feature values would trigger a landslide. 

Landslide is a devastating phenomenon for humans [7]—[13]. Some factors, e.g., climate conditions, 
human activities, geology structure, earthquakes, and topography, may trigger landslides. Rainfall is also a 
significant factor in landslide occurrences. Furthermore, as a maritime continent, Indonesia has abundant 
rainfall [14], [15]. Therefore, understanding the relationship between rainfall characteristics and the landslide 
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can contribute to the early warning of natural disasters in mitigation and management readiness. The integration 
of ML or AI with geo-information system (GIS) in landslide analysis has been studied previously [16]—[21]. 
Comparisons of several machine learning algorithms for landslide mapping were compared in [22]—[24]. The 
area under receiver operating characteristics curve (AUC) was the main metrics to assess the model 
performance in the literature [25]—[27]. The objectives of this paper are; assessing ML and AI performance in 
classifying whether some rainfall properties would trigger a landslide event or not; testing a data transformation 
prior to ML and AI model building and whether it can increase the performance of models. 


2. METHOD 
2.1. Data 

The data used for the response variables in this study were reports of landslide events in the form of 
the date the landslide occurred from 2014 until 2018. These data were taken from the Indonesian National 
Board for Disaster Management (BNPB). The study area was Banjarnegara, a regency in the southwestern part 
of Central Java province in Indonesia. According to BNPB data, it is shown that the Banjarnegara regency is 
the most landslide-prone area in Central Java. There were 84 records of landslides out of 1826 dates in the data. 
The records were labeled "Yes" and "No" for the landslide that did and did not occur, respectively. 

The availability of surface rainfall observation stations is minimal and unevenly distributed in most 
developing countries with highland and mountainous topography. This condition causes the forecasting of 
disasters and other extreme hydrometeorological phenomena to be challenging to perform [28]. These 
limitations make satellite rainfall estimation data are promising alternative solution [29]. Referring to this 
circumstance, the data used in this research were taken from the global satellite mapping of precipitation 
(GSMaP) as the explanatory variables. The data is presented in Spatio-temporal form. The spatial resolution 
of the data was 0.1°x0.1°, and the temporal resolution was in hourly format. The grid location at which we 
took the values from the GSMaP data was at the grid coordinate of 7°15'S and 109°45'E. Banjarnegara was in 
the coverage area of the selected GSMaP grid. We performed a pre-processing procedure to aggregate the 
GSMaP values from hourly to daily. The pre-processing procedure resulted in three explanatory variables, i.e., 
accumulation, intensity, and duration of rainfall at a particular date. The three derived variables were: i) 
accumulation: determined the total amount of rainfall on a date (mm); ii) intensity: calculated the average 
rainfall amount per hour for a particular date (mm/hour); and iii) duration: provided the total hours where 
rainfall happened on a date (hour). 

The dataset included three explanatory variables, i.e., accumulation, intensity, and duration of the 
rainfall, and one response variable, i.e., landslide occurrence. First, the records were aligned in dates between 
the explanatory and the response variables. Next, we shifted all three explanatory variables backward to 
construct lagged values for the landslide variable. Finally, the shifting process was repeated to provide the two 
and three lagged values for all three main rainfall properties from the original dataset. In this stage, there were 
nine explanatory variables and one response variable. The nine explanatory variables were constructed from 
three rainfall properties one to three days before the landslide occurred. The dataset was composed of 
1826 rowsx10 columns in tabular form. The rows denoted the dates, and the column denoted nine explanatory 
variables and one response variable. Thus, in prediction terms, the system would predict whether the next day 
from a particular day would be potentially susceptible to a landslide by considering three rainfall properties 
(accumulation, intensity, and duration) from that day and one to two days before. 

There were two methods examined in this study. The first method presented the dataset to the models 
directly. The second method applied two additional procedures: log-transformation and principal component 
analysis (PCA), before presenting the dataset to the models. Both datasets at each method were standardized 
just before being processed by the models. Eventually, in the second method's operational phase, input 
variables’ values were scaled and subsequentially multiplied with the eigenvector provided by PCA. In 
addition, we added | to all values before calculating the log result. The reason is that the minimum value of all 
explanatory variables was 0. Otherwise, an infinitely small value would be returned by the logarithm function. 
This way resulted in the minimum values of all variables becoming 1, and all values were one greater. Although 
the addition of | changed the values, the data characteristics remained the same. The unchanged data 
characteristic was because the addition procedure was performed to all values without exception. A concise 
summary of the overall workflow is displayed in Figure |. The data preparation phase, at the left part of the 
figure, consists of three main sections following input data. The model building phase, at the right part of the 
figure also has three sections after the data have been provided by the data preparation phase. 


2.1. Model 

There were four ML and one AI algorithm employed in this study. They are distributed random forest 
(DRF) [30], generalized linear model (GLM) [31], extreme gradient boosting (XGBoost) [32], generalized 
boosting machine (GBM) [33], and deep learning (DL) [34]-[36]. These algorithms were trained as binomial 
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classifiers as characterized by the training datasets of each method. However, the algorithms had different input 
variables for each method, as PCA in method-2 reduced the dataset's dimensionality. Several model candidates 
with the selected algorithms were built in each method. The model building phase utilized parallel computation 
with 15 cores and 20 gigabyte (GB) memory. The maximum of two hours of running time was determined for 
both methods. Each model candidate had different parameterization settings from another model candidate. 
After building the model candidate, two stacked ensembles (SE) model schemes were also carried out. The 
first SE was composed of every model candidate within the same algorithm. The second one was composed of 
all model candidates regardless of their algorithm. 

The model candidates, including SE models, were sorted using their area under receiver operating 
characteristics curve (AUC) performance. In addition, log-loss performance metrics were also provided. The 
performances (AUC and log-loss) of all models were obtained from cross-validated test data. We provided 20 
folds for model performance calculation with a random sampling scheme. 
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Figure 1. Workflow diagram 


3. RESULTS AND DISCUSSION 
3.1. Input data 

The initial distributions of data are displayed in Figure 2(a) and their log-transformed by Figure 2(b). 
The landslide occurrence distributions are denoted by red area and labeled as 'Yes' while those without 
landslide are denoted by blue area and labeled as 'No' in both figures. It can be observed that after being 
transformed, the densities of values in each variable were more clearly visible. Thus, the characteristics of the 
data were also more exaggerated. For method-2, the second data format was subsequentially fed into PCA. 
Each datum in the input data has nine variables. Therefore, PCA re-mapped these data into its axis, called the 
principal component axis (PC), and there were also nine PCs in this case. These PCs were orthogonal to each 
other and ordered from the highest variance for a naming convention (PC1 and PC2). Since each PC was 
orthogonal, selecting only a few of them was often adequate to retain the variance in the original data. 

The first three PCs had already gained more than 90% of the cumulative data variance explained, as 
displayed in Figure 3(a). Therefore, we only used three variables for method-2 instead of nine as in method-1. 
One thing to note is that PCA's axes are orthogonal. Hence, the multicollinearity among each variable was 
suppressed, as can be observed from the location of data in the PCA coordinate system in Figure 3(b). 


3.2. Model performance 

There were 57 models produced for method-1 and 72 for method-2 in two hours maximum running 
time for each method. The number of candidate models provided by each algorithm varied. Method -2 required 
less time for producing models since it had fewer explanatory variables than method-1. The performance of 
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models is represented by their location in log-loss and AUC coordinates, as displayed in Figure 4. AUC is a 
metric to obtain a binary (two-class) classifier model's performance. It ranges between 0 and 1, where 0 means 


a classifier model would predict a positive class as a negative class and vice versa. 
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Figure 2. The input data density: (a) before and (b) after transformed 
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On the other hand, a model having an AUC value of 1 means that the model perfectly predicts the 
actual class. If an AUC value of a model is close to 0.5, then the model behaves as if it is a random guess 
model, as in a coin toss where the probability of head and tail is equal, i.e., 0.5. Log-loss is another performance 
metric for a binary classifier. It summarizes the mean difference between the logarithm of probability a datum 
is classified into a particular class with its actual class. The lower a log-loss value is, the better a model is. 
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Figure 3. Data information retrieval using PCA denoted in (a) variance explained by each principal 
component axes and (b) the location of input data points in selected first-three PC axes: PC1, PC2, and PC3 
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Figure 4. The performance of candidate models at each method. The performances are displayed as location 
at the AUC and log-loss axes, where a better model has higher AUC and lower log-loss 


The best model is located in the top-left position. The top-left position has a combination of higher 
AUC and lower log-loss. Other performance metrics for the best models at each method are also displayed in 
the legend boxes. The legend boxes provide the information on mean squared error (MSE), root mean squared 
error (RMSE), mean per class error (MPCE), area under precision-recall (AUPCR), Gini index, and R-squared 
(R2). The performances of models at method-2 are quantitatively better than that of method-1. It can be 
observed from the fact that method-2 has higher values than method-1 for the greater-the-better metrics. The 
contribution of each explanatory variable is presented in Table 1. Method-1 has nine variables, while Method- 
2 has three. The critical values are presented in percentages. For example, the best model in method-1 
considered that variable rainfall duration on a particular date had the most significant contribution to landslide 
occurrence the next day. In method-2, variable PC2 was regarded as the most significant variable by the best 
model. 

Additionally, models in method-2 also have lower values for the lesser-the-better metrics. The log- 
loss values of models in method-2 are more concentrated at the lower part of the metric space. Therefore, it 
can be inferred that log-transformation and PCA were able to increase models' performance. Figure 5 shows 
the ROC curve of the best model in each method. The AUC values are also displayed at the bottom-right of 
the figure. The ROC of method-2 is slightly wider than that of method-1. Method-2 is more efficient than 
method-1 because it used fewer explanatory variables than method-1, but its AUC outperformed the AUC of 
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method-1. All metrics are calculated from the mean of 20 cross-validation datasets. Research related to the 
model development may also be performed in the future. For example, since this work only assessed the 
influence of rainfall on the landslide event, it is also essential to incorporate other variables such as soil type, 
land slope, and human activity. 
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Figure 5. The ROC of the best models in both methods 


Table 1. Variable importance, denoted in percent 


Method-1 Method-2 
Variable Importance percentage __ Variable _ Importance percentage 
Duration lag 1 22,57 PC3 50.24 
Accumulation lag 1 18.73 PCl 32.36 
Duration lag 2 14.05 PC2 17.4 
Accumulation lag 3 12.1 
Intensity lag 1 9.64 
Accumulation lag 2 6.17 
Duration lag 3 5.8 
Intensity lag 3 5.58 
Intensity lag 2 535 


4. CONCLUSION 

Our study finds that proper pre-processing procedures for the input data increased ML and AT 
performance. We differentiate two ML and AI model development methods in the model development phase. 
The first one did not undergo pre-processing, while the second one did. Furthermore, the second method 
outperformed the first method, yet it only required fewer explanatory variables than the first method. We 
applied log-transformation and PCA, as pre-processing procedures, to our input data for the second method. 
The GLM provided the best model for the first method and DL for the second one. The models were a prediction 
of landslide occurrence one day ahead, given satellite-derived rainfall properties. Our work may contribute to 
the development of a rainfall-induced landslide warning system for disaster mitigation and management. 
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